Data Cleaning

Data

df <- read.csv("simulated_diabetes.csv")

str(df)
## 'data.frame':    5000 obs. of  18 variables:
##  $ age                               : int  18 53 41 40 83 82 29 19 24 66 ...
##  $ gender                            : chr  "Female" "Male" "Male" "Male" ...
##  $ smoking_status                    : chr  "Former" "Never" "Former" "Never" ...
##  $ physical_activity_minutes_per_week: int  80 66 20 106 45 123 103 63 207 211 ...
##  $ diet_score                        : num  6.3 6.3 9.3 7.8 7.6 5.8 6.2 6 3.7 6.7 ...
##  $ sleep_hours_per_day               : num  9 7.6 8.5 6.8 5.7 8.9 6.2 5.3 7.3 6.8 ...
##  $ screen_time_hours_per_day         : num  8.3 8.9 8.9 0.5 5 7.3 4.7 7.3 10.5 5.4 ...
##  $ family_history_diabetes           : int  0 1 0 0 1 0 0 0 0 1 ...
##  $ bmi                               : num  21.9 25 28.2 29.3 31.1 26.4 32.1 24.6 32.2 27.1 ...
##  $ systolic_bp                       : int  90 127 118 91 140 143 95 110 114 116 ...
##  $ diastolic_bp                      : int  82 65 80 90 85 84 80 59 65 90 ...
##  $ heart_rate                        : int  82 65 85 77 71 61 73 66 69 60 ...
##  $ cholesterol_total                 : int  141 184 211 222 196 173 154 214 180 240 ...
##  $ hdl_cholesterol                   : int  46 73 39 59 31 51 75 72 50 63 ...
##  $ ldl_cholesterol                   : int  64 81 139 142 136 78 50 122 114 153 ...
##  $ triglycerides                     : int  175 169 167 31 172 172 161 59 128 181 ...
##  $ hba1c                             : num  6.85 5.84 5.37 7.36 5.7 8.43 6.53 6.7 6.32 7.35 ...
##  $ diagnosed_diabetes                : int  1 0 0 1 0 1 1 1 1 1 ...
head(df)
##   age gender smoking_status physical_activity_minutes_per_week diet_score
## 1  18 Female         Former                                 80        6.3
## 2  53   Male          Never                                 66        6.3
## 3  41   Male         Former                                 20        9.3
## 4  40   Male          Never                                106        7.8
## 5  83   Male          Never                                 45        7.6
## 6  82 Female          Never                                123        5.8
##   sleep_hours_per_day screen_time_hours_per_day family_history_diabetes  bmi
## 1                 9.0                       8.3                       0 21.9
## 2                 7.6                       8.9                       1 25.0
## 3                 8.5                       8.9                       0 28.2
## 4                 6.8                       0.5                       0 29.3
## 5                 5.7                       5.0                       1 31.1
## 6                 8.9                       7.3                       0 26.4
##   systolic_bp diastolic_bp heart_rate cholesterol_total hdl_cholesterol
## 1          90           82         82               141              46
## 2         127           65         65               184              73
## 3         118           80         85               211              39
## 4          91           90         77               222              59
## 5         140           85         71               196              31
## 6         143           84         61               173              51
##   ldl_cholesterol triglycerides hba1c diagnosed_diabetes
## 1              64           175  6.85                  1
## 2              81           169  5.84                  0
## 3             139           167  5.37                  0
## 4             142            31  7.36                  1
## 5             136           172  5.70                  0
## 6              78           172  8.43                  1
dim(df)
## [1] 5000   18
# Check for missing values
sum(is.na(df))
## [1] 0
# Check the ranges of continuous variables
ranges <- sapply(df[, sapply(df, is.numeric)], 
                 function(x) c(min = min(x, na.rm = TRUE),
                               max = max(x, na.rm = TRUE)))
t(ranges)
##                                      min    max
## age                                 18.0  90.00
## physical_activity_minutes_per_week   2.0 651.00
## diet_score                           0.0  10.00
## sleep_hours_per_day                  3.0  10.00
## screen_time_hours_per_day            0.5  15.00
## family_history_diabetes              0.0   1.00
## bmi                                 15.0  38.90
## systolic_bp                         90.0 167.00
## diastolic_bp                        50.0 107.00
## heart_rate                          40.0  99.00
## cholesterol_total                  100.0 310.00
## hdl_cholesterol                     20.0  96.00
## ldl_cholesterol                     50.0 263.00
## triglycerides                       30.0 278.00
## hba1c                                4.0   9.51
## diagnosed_diabetes                   0.0   1.00

Comment: The dataset contains 5000 observations with no missing values. All continuous variables fall within expected physiological or behavioral ranges, indicating there are no obvious data entry errors or implausible values. The structure and initial rows confirm that variable types were correctly loaded, and the dataset is ready for preprocessing and modeling.

String Variables

# Convert gender, smoking_status, family_history_diabetes, and diagnosed_diabetes as factor variables
df <- df %>% mutate(
  gender = as.factor(gender),
  smoking_status = as.factor(smoking_status),
  family_history_diabetes = as.factor(family_history_diabetes),
  diagnosed_diabetes = as.factor(diagnosed_diabetes)
)

# Check the factor variables
table(df$gender)
## 
## Female   Male  Other 
##   2529   2390     81
table(df$smoking_status)
## 
## Current  Former   Never 
##     964     995    3041
table(df$family_history_diabetes)
## 
##    0    1 
## 3915 1085
table(df$diagnosed_diabetes)
## 
##    0    1 
## 2064 2936

Comment: String variables were successfully converted to factors, and the resulting category distributions look appropriate. No unexpected or empty categories were detected, so these variables are now properly formatted for modeling. Gender includes a small “Other” category, which may reflect participants who do not identify strictly as male or female and choose to report a nonbinary or alternative gender identity.

Continuous Variables

summary(df)
##       age           gender     smoking_status
##  Min.   :18.00   Female:2529   Current: 964  
##  1st Qu.:39.00   Male  :2390   Former : 995  
##  Median :50.00   Other :  81   Never  :3041  
##  Mean   :49.75                               
##  3rd Qu.:61.00                               
##  Max.   :90.00                               
##  physical_activity_minutes_per_week   diet_score     sleep_hours_per_day
##  Min.   :  2.0                      Min.   : 0.000   Min.   : 3.000     
##  1st Qu.: 57.0                      1st Qu.: 4.700   1st Qu.: 6.300     
##  Median : 98.0                      Median : 6.000   Median : 7.000     
##  Mean   :118.4                      Mean   : 5.943   Mean   : 7.022     
##  3rd Qu.:158.0                      3rd Qu.: 7.200   3rd Qu.: 7.800     
##  Max.   :651.0                      Max.   :10.000   Max.   :10.000     
##  screen_time_hours_per_day family_history_diabetes      bmi       
##  Min.   : 0.500            0:3915                  Min.   :15.00  
##  1st Qu.: 4.300            1:1085                  1st Qu.:23.20  
##  Median : 5.900                                    Median :25.65  
##  Mean   : 5.958                                    Mean   :25.63  
##  3rd Qu.: 7.600                                    3rd Qu.:28.10  
##  Max.   :15.000                                    Max.   :38.90  
##   systolic_bp     diastolic_bp      heart_rate    cholesterol_total
##  Min.   : 90.0   Min.   : 50.00   Min.   :40.00   Min.   :100.0    
##  1st Qu.:105.0   1st Qu.: 70.00   1st Qu.:64.00   1st Qu.:164.0    
##  Median :116.0   Median : 75.00   Median :70.00   Median :186.0    
##  Mean   :115.7   Mean   : 75.32   Mean   :69.69   Mean   :185.9    
##  3rd Qu.:125.0   3rd Qu.: 81.00   3rd Qu.:76.00   3rd Qu.:208.0    
##  Max.   :167.0   Max.   :107.00   Max.   :99.00   Max.   :310.0    
##  hdl_cholesterol ldl_cholesterol triglycerides     hba1c      
##  Min.   :20.00   Min.   : 50.0   Min.   : 30   Min.   :4.000  
##  1st Qu.:47.00   1st Qu.: 78.0   1st Qu.: 91   1st Qu.:5.940  
##  Median :54.00   Median :102.0   Median :121   Median :6.500  
##  Mean   :53.99   Mean   :103.1   Mean   :121   Mean   :6.501  
##  3rd Qu.:61.00   3rd Qu.:127.0   3rd Qu.:151   3rd Qu.:7.060  
##  Max.   :96.00   Max.   :263.0   Max.   :278   Max.   :9.510  
##  diagnosed_diabetes
##  0:2064            
##  1:2936            
##                    
##                    
##                    
## 
# histograms for numeric columns to visually assess distributions
df %>%
  select_if(is.numeric) %>%
  gather(variable, value) %>%
  ggplot(aes(x = value)) +
  geom_histogram(bins = 30) +
  facet_wrap(~ variable, scales = "free", ncol = 2) +
  labs(title = "Distributions of Continuous Variables")

Comment: I checked the continuous variables using summary statistics and visualized their distributions with histograms. No variables appeared to be incorrectly coded as numeric indicators, and the observed ranges were consistent with what would be expected for an adult population. Several variables, such as physical activity minutes per week, screen time hours per day, and triglycerides, showed noticeable right-skewness, which is typical for behavioral and biomarker data. Other variables, including heart rate and blood pressure, displayed more symmetric distributions. Overall, no unusual or implausible values were detected, and the continuous variables appear clean and ready for analysis.

Training Validation Test Sets

# training+validation = 75% vs. testing = 25%
set.seed(123)
trainval <- createDataPartition(y=df$diagnosed_diabetes, p=0.75, list=FALSE)

dftrval <- df[trainval,]
dftest <- df[-trainval,]

# training = 50% vs. Validation = 25%
set.seed(123)
train <- createDataPartition(y=dftrval$diagnosed_diabetes, p=2/3, list=FALSE)
dftr <- dftrval[train,]
dfval <- dftrval[-train,]

# Verify the data
dim(df); dim(dftrval); dim(dftest); dim(dftr); dim(dfval)
## [1] 5000   18
## [1] 3750   18
## [1] 1250   18
## [1] 2500   18
## [1] 1250   18

EDA

summary(df)
##       age           gender     smoking_status
##  Min.   :18.00   Female:2529   Current: 964  
##  1st Qu.:39.00   Male  :2390   Former : 995  
##  Median :50.00   Other :  81   Never  :3041  
##  Mean   :49.75                               
##  3rd Qu.:61.00                               
##  Max.   :90.00                               
##  physical_activity_minutes_per_week   diet_score     sleep_hours_per_day
##  Min.   :  2.0                      Min.   : 0.000   Min.   : 3.000     
##  1st Qu.: 57.0                      1st Qu.: 4.700   1st Qu.: 6.300     
##  Median : 98.0                      Median : 6.000   Median : 7.000     
##  Mean   :118.4                      Mean   : 5.943   Mean   : 7.022     
##  3rd Qu.:158.0                      3rd Qu.: 7.200   3rd Qu.: 7.800     
##  Max.   :651.0                      Max.   :10.000   Max.   :10.000     
##  screen_time_hours_per_day family_history_diabetes      bmi       
##  Min.   : 0.500            0:3915                  Min.   :15.00  
##  1st Qu.: 4.300            1:1085                  1st Qu.:23.20  
##  Median : 5.900                                    Median :25.65  
##  Mean   : 5.958                                    Mean   :25.63  
##  3rd Qu.: 7.600                                    3rd Qu.:28.10  
##  Max.   :15.000                                    Max.   :38.90  
##   systolic_bp     diastolic_bp      heart_rate    cholesterol_total
##  Min.   : 90.0   Min.   : 50.00   Min.   :40.00   Min.   :100.0    
##  1st Qu.:105.0   1st Qu.: 70.00   1st Qu.:64.00   1st Qu.:164.0    
##  Median :116.0   Median : 75.00   Median :70.00   Median :186.0    
##  Mean   :115.7   Mean   : 75.32   Mean   :69.69   Mean   :185.9    
##  3rd Qu.:125.0   3rd Qu.: 81.00   3rd Qu.:76.00   3rd Qu.:208.0    
##  Max.   :167.0   Max.   :107.00   Max.   :99.00   Max.   :310.0    
##  hdl_cholesterol ldl_cholesterol triglycerides     hba1c      
##  Min.   :20.00   Min.   : 50.0   Min.   : 30   Min.   :4.000  
##  1st Qu.:47.00   1st Qu.: 78.0   1st Qu.: 91   1st Qu.:5.940  
##  Median :54.00   Median :102.0   Median :121   Median :6.500  
##  Mean   :53.99   Mean   :103.1   Mean   :121   Mean   :6.501  
##  3rd Qu.:61.00   3rd Qu.:127.0   3rd Qu.:151   3rd Qu.:7.060  
##  Max.   :96.00   Max.   :263.0   Max.   :278   Max.   :9.510  
##  diagnosed_diabetes
##  0:2064            
##  1:2936            
##                    
##                    
##                    
## 
# Histogram of numeric variables
hist(df$age)

hist(df$physical_activity_minutes_per_week)

hist(df$diet_score)

hist(df$sleep_hours_per_day)

hist(df$screen_time_hours_per_day)

hist(df$bmi)

hist(df$systolic_bp)

hist(df$diastolic_bp)

hist(df$heart_rate)

hist(df$cholesterol_total)

hist(df$hdl_cholesterol)

hist(df$ldl_cholesterol)

hist(df$triglycerides)

hist(df$hba1c)

# List of numeric variables
num_vars <- c("age", "physical_activity_minutes_per_week", "diet_score", "sleep_hours_per_day", "screen_time_hours_per_day", "bmi", "systolic_bp", "diastolic_bp", "heart_rate", "cholesterol_total", "hdl_cholesterol", "ldl_cholesterol", "triglycerides", "hba1c")

# Correlation among numeric variables
cor_matrix <- cor(df[num_vars], use = "pairwise.complete.obs")
round(cor_matrix[1:6, 1:6], 2)
##                                     age physical_activity_minutes_per_week
## age                                1.00                               0.00
## physical_activity_minutes_per_week 0.00                               1.00
## diet_score                         0.00                               0.00
## sleep_hours_per_day                0.00                              -0.03
## screen_time_hours_per_day          0.01                               0.01
## bmi                                0.09                              -0.07
##                                    diet_score sleep_hours_per_day
## age                                      0.00                0.00
## physical_activity_minutes_per_week       0.00               -0.03
## diet_score                               1.00                0.01
## sleep_hours_per_day                      0.01                1.00
## screen_time_hours_per_day               -0.01               -0.01
## bmi                                     -0.19               -0.01
##                                    screen_time_hours_per_day   bmi
## age                                                     0.01  0.09
## physical_activity_minutes_per_week                      0.01 -0.07
## diet_score                                             -0.01 -0.19
## sleep_hours_per_day                                    -0.01 -0.01
## screen_time_hours_per_day                               1.00 -0.01
## bmi                                                    -0.01  1.00
# Graphical summaries
par(mfrow = c(2, 3))
for (v in num_vars) {
  hist(df[[v]], main = v, col = "lightblue", border = "white")
}

# Pairwise scatterplot matrix
pairs(df[, num_vars[1:5]], main = "Pairwise scatterplot matrix")

# Summary of categorical variables
summary(df[, c("gender", "smoking_status", "family_history_diabetes", "diagnosed_diabetes")])
##     gender     smoking_status family_history_diabetes diagnosed_diabetes
##  Female:2529   Current: 964   0:3915                  0:2064            
##  Male  :2390   Former : 995   1:1085                  1:2936            
##  Other :  81   Never  :3041
cat_vars <- df[, sapply(df, is.character) | sapply(df, is.factor)]

lapply(cat_vars, function(x) {
  list(
    table = table(x),
    proportion = prop.table(table(x))
  )
})
## $gender
## $gender$table
## x
## Female   Male  Other 
##   2529   2390     81 
## 
## $gender$proportion
## x
## Female   Male  Other 
## 0.5058 0.4780 0.0162 
## 
## 
## $smoking_status
## $smoking_status$table
## x
## Current  Former   Never 
##     964     995    3041 
## 
## $smoking_status$proportion
## x
## Current  Former   Never 
##  0.1928  0.1990  0.6082 
## 
## 
## $family_history_diabetes
## $family_history_diabetes$table
## x
##    0    1 
## 3915 1085 
## 
## $family_history_diabetes$proportion
## x
##     0     1 
## 0.783 0.217 
## 
## 
## $diagnosed_diabetes
## $diagnosed_diabetes$table
## x
##    0    1 
## 2064 2936 
## 
## $diagnosed_diabetes$proportion
## x
##      0      1 
## 0.4128 0.5872
ggplot(df, aes(x = gender)) +
  geom_bar() +
  coord_flip() +  
  theme_minimal() +
  labs(title = "Distribution of gender",
       x = "gender",
       y = "Count")

ggplot(df, aes(x = smoking_status)) +
  geom_bar() +
  coord_flip() +
  theme_minimal()

  labs(title = "Distribution of smoking status",
       x = "smoking_status",
       y = "Count")
## <ggplot2::labels> List of 3
##  $ x    : chr "smoking_status"
##  $ y    : chr "Count"
##  $ title: chr "Distribution of smoking status"
ggplot(df, aes(x = factor(family_history_diabetes))) +
  geom_bar() +
  theme_minimal()

  labs(title = "Distribution of family history of diabetes",
       x = "family_history_diabetes",
       y = "Count")
## <ggplot2::labels> List of 3
##  $ x    : chr "family_history_diabetes"
##  $ y    : chr "Count"
##  $ title: chr "Distribution of family history of diabetes"
ggplot(df, aes(x = factor(diagnosed_diabetes))) +
  geom_bar() +
  theme_minimal()

  labs(title = "Distribution of diagnosed diabetes",
       x = "diagnosed_diabetes",
       y = "Count")
## <ggplot2::labels> List of 3
##  $ x    : chr "diagnosed_diabetes"
##  $ y    : chr "Count"
##  $ title: chr "Distribution of diagnosed diabetes"

age: Age ranges from 18 to 90, with a median of 50. The distribution is roughly symmetric and centered in mid-adulthood, indicating broad representation across the adult lifespan.

physical_activity_minutes_per_week: Weekly physical activity ranges from 2 to 651 minutes, with a strong right-skew driven by a small number of highly active individuals. Most participants cluster between 50 and 200 minutes per week.

diet_score: Diet score ranges from 0 to 10, with a median near 6. The distribution is approximately symmetric, suggesting a balanced spread of dietary quality levels across the sample.

sleep_hours_per_day: Sleep hours range from 3 to 10, with a median of 7. The distribution is bell-shaped, indicating most individuals report 6–8 hours per night.

screen_time_hours_per_day: Screen time ranges from 0.5 to 15 hours per day and shows right-skewness. While most participants report around 4–7 hours, a smaller group reports much higher use.

bmi: BMI ranges from 15.0 to 38.9, with a median of 25.7. The distribution is moderately symmetric and centered in the mid-20s, consistent with a population spanning normal weight to obesity.

systolic_bp: Systolic blood pressure ranges from 90 to 167, with a median of 116. The distribution is right-skewed, reflecting the presence of individuals with elevated blood pressure.

diastolic_bp: Diastolic blood pressure ranges from 50 to 107, with a median of 75. The distribution is roughly symmetric, centered within the typical normal-to-borderline range.

heart_rate: Heart rate ranges from 40 to 99 beats per minute, with a median of 70. The distribution appears symmetric and aligns with expected resting heart rate values.

cholesterol_total: Total cholesterol ranges from 100 to 310, with a median of 186. The distribution is slightly right-skewed, with most values falling within standard clinical ranges.

hdl_cholesterol: HDL cholesterol ranges from 20 to 96, with a median of 54. Values follow a symmetric distribution centered around typical HDL levels.

ldl_cholesterol: LDL cholesterol ranges from 50 to 263, with a median of 102. Although mostly centered in the normal range, the distribution shows a right tail reflecting individuals with elevated LDL.

triglycerides: Triglycerides range from 30 to 278, showing right-skewness. Most values fall around 100–150, but a subset of participants contributes to a long upper tail.

hba1c: HbA1c ranges from 4.0 to 9.5, with a median of 6.5. The distribution is slightly right-skewed, reflecting a mix of normoglycemic individuals and those with elevated glucose levels.

gender: Gender includes three categories: Female (50.6 percent), Male (47.8 percent), and a small Other group (1.6 percent). The distribution is nearly balanced between males and females, with a very small proportion identifying as Other.

smoking_status: Smoking status consists of Current (19.3 percent), Former (19.9 percent), and Never smokers (60.8 percent). Most participants report never smoking, while current and former smokers are represented in similar proportions.

family_history_diabetes: Family history of diabetes is coded as 0 or 1, with 78.3 percent reporting no family history and 21.7 percent reporting a positive history. This distribution suggests that family history of diabetes is present in a meaningful minority of the sample.

diagnosed_diabetes: Diagnosed diabetes is also binary, with 41.3 percent reporting no diagnosis and 58.7 percent reporting having been diagnosed. The relatively high proportion of diagnosed diabetes reflects the simulated dataset’s design to include a substantial number of individuals with diabetes.

Tree Based / Random Forest / Logistic Regression / Multiple Testing

Tree Based Prediction

# Build tree (exclude hba1c)
set.seed(123)
diab.tree <- tree(diagnosed_diabetes ~ . - hba1c, data = dftr)

summary(diab.tree)
## 
## Classification tree:
## tree(formula = diagnosed_diabetes ~ . - hba1c, data = dftr)
## Variables actually used in tree construction:
## [1] "family_history_diabetes" "age"                    
## Number of terminal nodes:  3 
## Residual mean deviance:  1.301 = 3250 / 2497 
## Misclassification error rate: 0.4084 = 1021 / 2500
# Plot tree
plot(diab.tree)
text(diab.tree, pretty = 1, cex = 0.6)

# Predict on validation set
diab.pred <- predict(diab.tree, newdata = dfval, type = "class")

# Validation accuracy
conf <- table(Predicted = diab.pred, Actual = dfval$diagnosed_diabetes)
conf
##          Actual
## Predicted   0   1
##         0 351 372
##         1 165 362
val_acc <- sum(diab.pred == dfval$diagnosed_diabetes) / nrow(dfval)
val_acc
## [1] 0.5704

Comment: The final classification tree contained 3 terminal nodes. The first node represents individuals with no family history of diabetes and age less than 60.5 years, and this node predicts no diagnosed diabetes (class 0). The second node contains individuals with no family history but age 60.5 years or older, and it predicts diagnosed diabetes (class 1). The third node includes all individuals with a positive family history of diabetes, regardless of age, and also predicts diagnosed diabetes (class 1). When applied to the validation set, the tree achieved a classification accuracy of 0.5704, corresponding to approximately 57 percent correctly classified observations.

Random Forest Prediction

set.seed(123)

# Random forest with mtry = 4
rf4 <- randomForest(
  diagnosed_diabetes ~ . - hba1c,
  data = dftr,
  mtry = 4,
  importance = TRUE
)

# Predict validation set
pred4 <- predict(rf4, newdata = dfval, type = "class")
acc4 <- mean(pred4 == dfval$diagnosed_diabetes)
acc4
## [1] 0.596
# Random forest with mtry = 16
rf16 <- randomForest(
  diagnosed_diabetes ~ . - hba1c,
  data = dftr,
  mtry = 16,
  importance = TRUE
)

# Predict validation set
pred16 <- predict(rf16, newdata = dfval, type = "class")
acc16 <- mean(pred16 == dfval$diagnosed_diabetes)
acc16
## [1] 0.592
# Variable importance plots
varImpPlot(rf4, main = "Variable Importance (mtry = 4)")

varImpPlot(rf16, main = "Variable Importance (mtry = 16)")

# Numeric importance values
importance(rf4)
##                                             0           1 MeanDecreaseAccuracy
## age                                 2.9106674 10.01016800           10.4292845
## gender                             -0.5101661 -0.05574139           -0.4060277
## smoking_status                     -2.5353483 -1.41231167           -2.6761152
## physical_activity_minutes_per_week  1.8022128  5.23803107            4.8724241
## diet_score                          2.9675872 -1.44963830            0.9768614
## sleep_hours_per_day                 1.8914093 -1.85860346           -0.1477519
## screen_time_hours_per_day          -1.1728497 -2.18171737           -2.4716202
## family_history_diabetes            23.1306787 15.27501332           25.6431233
## bmi                                 1.7178127  6.86758391            6.3547409
## systolic_bp                         3.4051658  2.81477710            4.5092053
## diastolic_bp                        2.4917038 -2.21383496           -0.2018646
## heart_rate                         -2.1117244  2.41196324            0.4922880
## cholesterol_total                   1.1021219  4.92255555            5.0727158
## hdl_cholesterol                     4.7458061  2.56333285            5.1525057
## ldl_cholesterol                    -0.8124204  2.17085413            1.3052503
## triglycerides                       0.3497661  5.93905969            4.7989056
##                                    MeanDecreaseGini
## age                                        98.99092
## gender                                     15.86020
## smoking_status                             24.25933
## physical_activity_minutes_per_week        101.41047
## diet_score                                 86.28352
## sleep_hours_per_day                        82.14977
## screen_time_hours_per_day                  87.99840
## family_history_diabetes                    40.34213
## bmi                                        97.66736
## systolic_bp                                86.24187
## diastolic_bp                               75.02284
## heart_rate                                 75.66762
## cholesterol_total                          84.65377
## hdl_cholesterol                            84.09817
## ldl_cholesterol                            76.68814
## triglycerides                              93.80277
importance(rf16)
##                                             0           1 MeanDecreaseAccuracy
## age                                 1.7520901 13.92104130         11.986278216
## gender                              0.6973503  0.73510727          1.007488233
## smoking_status                     -1.1983922  2.83844642          1.493587355
## physical_activity_minutes_per_week  3.4126135  5.99460392          6.719832534
## diet_score                         -0.2619317 -0.57290168         -0.674957426
## sleep_hours_per_day                 3.2977540 -2.65110246          0.216079738
## screen_time_hours_per_day          -3.2127876 -1.95705130         -3.657706695
## family_history_diabetes            26.2562005 16.63355898         28.485102350
## bmi                                 3.1482750  7.15717096          7.672483461
## systolic_bp                         2.2698859  5.39478094          5.787445505
## diastolic_bp                        1.5638384 -0.04137306          1.027256148
## heart_rate                         -2.2026319  2.05696996         -0.004237676
## cholesterol_total                   0.2437571  6.36585151          6.090996614
## hdl_cholesterol                     0.4943518  3.88314897          3.487239232
## ldl_cholesterol                    -3.5861710  4.59114302          1.688794765
## triglycerides                      -2.6978055  5.42416641          2.268822911
##                                    MeanDecreaseGini
## age                                       102.42393
## gender                                     12.59000
## smoking_status                             20.13815
## physical_activity_minutes_per_week        107.43660
## diet_score                                 88.43551
## sleep_hours_per_day                        83.88539
## screen_time_hours_per_day                  88.89550
## family_history_diabetes                    42.73190
## bmi                                       103.09155
## systolic_bp                                83.94659
## diastolic_bp                               75.35478
## heart_rate                                 76.08679
## cholesterol_total                          79.08438
## hdl_cholesterol                            86.30698
## ldl_cholesterol                            65.64828
## triglycerides                              95.39584

Comment: Between the two models, mtry = 4 produced the highest validation accuracy (0.596), slightly outperforming mtry = 16 (0.592). Therefore, mtry = 4 provides the best predictive performance. Across both models, the most influential predictors were family history, metabolic, and lifestyle variables. Variables such as family_history_diabetes, physical_activity_minutes_per_week, BMI, age, triglycerides, screen_time_hours_per_day, diet_score, systolic_bp, HDL cholesterol, and total cholesterol consistently showed high importance based on both MeanDecreaseAccuracy and MeanDecreaseGini. The paired importance plots confirm that diabetes diagnosis in this dataset is largely driven by a combination of metabolic health factors and behavioral patterns. The predicted validation accuracies were 0.596 for mtry = 4 and 0.592 for mtry = 16. Thus, the random forest with mtry = 4 achieves the best predictive accuracy.

Logistic Regression and Multiple Testing

## Full logistic regression (training set)
glm_full <- glm(
  diagnosed_diabetes ~ . - hba1c,
  data   = dftr,
  family = binomial
)

summary(glm_full)
## 
## Call:
## glm(formula = diagnosed_diabetes ~ . - hba1c, family = binomial, 
##     data = dftr)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        -1.9245329  0.8510000  -2.261   0.0237 *  
## age                                 0.0175330  0.0034249   5.119 3.07e-07 ***
## genderMale                          0.1087732  0.0858735   1.267   0.2053    
## genderOther                         0.2413348  0.3451864   0.699   0.4845    
## smoking_statusFormer               -0.0302847  0.1385412  -0.219   0.8270    
## smoking_statusNever                -0.0589888  0.1140289  -0.517   0.6049    
## physical_activity_minutes_per_week -0.0022916  0.0005229  -4.383 1.17e-05 ***
## diet_score                         -0.0606903  0.0240223  -2.526   0.0115 *  
## sleep_hours_per_day                 0.0145636  0.0386959   0.376   0.7066    
## screen_time_hours_per_day           0.0230782  0.0175071   1.318   0.1874    
## family_history_diabetes1            1.0847880  0.1170133   9.271  < 2e-16 ***
## bmi                                 0.0278031  0.0137038   2.029   0.0425 *  
## systolic_bp                         0.0074614  0.0036527   2.043   0.0411 *  
## diastolic_bp                        0.0008221  0.0053701   0.153   0.8783    
## heart_rate                          0.0049275  0.0052336   0.942   0.3464    
## cholesterol_total                   0.0009862  0.0040608   0.243   0.8081    
## hdl_cholesterol                    -0.0103927  0.0057682  -1.802   0.0716 .  
## ldl_cholesterol                    -0.0013528  0.0040800  -0.332   0.7402    
## triglycerides                       0.0011692  0.0010700   1.093   0.2745    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3389.3  on 2499  degrees of freedom
## Residual deviance: 3162.4  on 2481  degrees of freedom
## AIC: 3200.4
## 
## Number of Fisher Scoring iterations: 4
## Extract p values (excluding intercept)
coef_tab <- coef(summary(glm_full))
pvals    <- coef_tab[-1, "Pr(>|z|)"]        # drop intercept
varnames <- names(pvals)
m        <- length(pvals)                   # number of tests

m
## [1] 18
pvals
##                                age                         genderMale 
##                       3.066612e-07                       2.052741e-01 
##                        genderOther               smoking_statusFormer 
##                       4.844624e-01                       8.269637e-01 
##                smoking_statusNever physical_activity_minutes_per_week 
##                       6.049365e-01                       1.172132e-05 
##                         diet_score                sleep_hours_per_day 
##                       1.152345e-02                       7.066496e-01 
##          screen_time_hours_per_day           family_history_diabetes1 
##                       1.874306e-01                       1.850418e-20 
##                                bmi                        systolic_bp 
##                       4.247197e-02                       4.108104e-02 
##                       diastolic_bp                         heart_rate 
##                       8.783288e-01                       3.464420e-01 
##                  cholesterol_total                    hdl_cholesterol 
##                       8.081176e-01                       7.158670e-02 
##                    ldl_cholesterol                      triglycerides 
##                       7.402129e-01                       2.744824e-01

Significant predictors

alpha <- 0.05

sig_unadj <- varnames[pvals < alpha]
sig_unadj
## [1] "age"                                "physical_activity_minutes_per_week"
## [3] "diet_score"                         "family_history_diabetes1"          
## [5] "bmi"                                "systolic_bp"

Comment: At the unadjusted α = 0.05 level, the significant predictors were age, physical_activity_minutes_per_week, diet_score, family_history_diabetes, bmi, and systolic_bp.

Significant predictors after controlling FWER with Bonferroni’s method

alpha_bonf <- alpha / m   # Bonferroni threshold

alpha_bonf
## [1] 0.002777778
sig_bonf <- varnames[pvals < alpha_bonf]
sig_bonf
## [1] "age"                                "physical_activity_minutes_per_week"
## [3] "family_history_diabetes1"

Comment: Using Bonferroni’s method to control the family-wise error rate, the significance threshold becomes α/m = 0.05/18 ≈ 0.00278. Under this more conservative cutoff, only three variables remained statistically significant: age, physical_activity_minutes_per_week, and family_history_diabetes. These predictors had p-values below the Bonferroni-adjusted threshold and therefore showed strong evidence of association with diagnosed diabetes even after controlling for multiple testing.

Significant predictors after controlling FWER with Holm’s method

# Order p values
ord        <- order(pvals)
p_sorted   <- pvals[ord]
vars_sorted <- varnames[ord]

# Holm critical values: alpha/(m - k + 1)
k_idx  <- seq_along(p_sorted)
crit_H <- alpha / (m - k_idx + 1)

# Find largest k with p_(k) <- crit_H(k)
rej_idx <- which(p_sorted <= crit_H)

if (length(rej_idx) == 0) {
  sig_holm <- character(0)
} else {
  k_max   <- max(rej_idx)
  sig_holm <- vars_sorted[1:k_max]
}

crit_H
##  [1] 0.002777778 0.002941176 0.003125000 0.003333333 0.003571429 0.003846154
##  [7] 0.004166667 0.004545455 0.005000000 0.005555556 0.006250000 0.007142857
## [13] 0.008333333 0.010000000 0.012500000 0.016666667 0.025000000 0.050000000
sig_holm
## [1] "family_history_diabetes1"           "age"                               
## [3] "physical_activity_minutes_per_week"

Comment: Using Holm’s step-down procedure to control the family-wise error rate, the p-values were first sorted and compared to the sequential Holm critical values α/(m − k + 1). Three variables met the Holm rejection criteria: age, physical_activity_minutes_per_week, and family_history_diabetes. The Holm-adjusted set was identical to the Bonferroni-adjusted set in this dataset. These variables remained significant even after applying Holm’s more powerful FWER correction, indicating strong evidence of association with diagnosed diabetes.

Significant predictors after controlling for FDR at the q=0.05 level

q      <- 0.05
k_idx  <- seq_along(p_sorted)
crit_BH <- (k_idx / m) * q

rej_idx_BH <- which(p_sorted <= crit_BH)

if (length(rej_idx_BH) == 0) {
  sig_fdr <- character(0)
} else {
  k_max_BH <- max(rej_idx_BH)
  sig_fdr  <- vars_sorted[1:k_max_BH]
}

crit_BH
##  [1] 0.002777778 0.005555556 0.008333333 0.011111111 0.013888889 0.016666667
##  [7] 0.019444444 0.022222222 0.025000000 0.027777778 0.030555556 0.033333333
## [13] 0.036111111 0.038888889 0.041666667 0.044444444 0.047222222 0.050000000
sig_fdr
## [1] "family_history_diabetes1"           "age"                               
## [3] "physical_activity_minutes_per_week"

Comment: Applying the Benjamini–Hochberg procedure to control the false discovery rate at q = 0.05, the p-values were ordered and compared to the BH critical vaules (k/m)q. Three variables met the BH rejection criterion and were therefore significant after controlling for FDR: age, physical_activity_minutes_per_week, and family_history_diabetes. As with Holm and Bonferroni, the same three predictors were retained under FDR correction.These variables had sufficiently small p-values relative to their BH thresholds, indicating that they remain statistically significant even when controlling the expected proportion of false discoveries.

Logistic regressions based on multiple testing results

# a) Unadjusted 0.05
vars_unadj <- c(
  "age",
  "diet_score",
  "bmi",
  "physical_activity_minutes_per_week",
  "family_history_diabetes",  
  "systolic_bp"
)

# b) Bonferroni
vars_bonf <- c(
  "family_history_diabetes",
  "physical_activity_minutes_per_week",
  "age"
)

# c) Holm 
vars_holm <- c(
  "family_history_diabetes",
  "physical_activity_minutes_per_week",
  "age"
)

# d) FDR 
vars_fdr <- c(
  "family_history_diabetes",
  "physical_activity_minutes_per_week",
  "age"
)

# Validation error function
logit_val_error <- function(vars, dftr, dfval, threshold = 0.5) {
  if (length(vars) == 0) {
    form <- diagnosed_diabetes ~ 1
  } else {
    form <- as.formula(
      paste("diagnosed_diabetes ~", paste(vars, collapse = " + "))
    )
  }
  
  fit <- glm(form, data = dftr, family = binomial)
  
  p_hat <- predict(fit, newdata = dfval, type = "response")
  y_hat <- ifelse(p_hat >= threshold, 1, 0)
  y_true <- ifelse(dfval$diagnosed_diabetes %in% c(1, "1"), 1, 0)
  
  acc <- mean(y_hat == y_true)
  err <- 1 - acc
  
  list(model = fit, accuracy = acc, error = err)
}

# 1) unadjusted model
res_unadj <- logit_val_error(vars_unadj, dftr, dfval)
res_unadj$accuracy
## [1] 0.6208
res_unadj$error
## [1] 0.3792
# 2) Bonferroni model
res_bonf <- logit_val_error(vars_bonf, dftr, dfval)
res_bonf$accuracy
## [1] 0.6064
res_bonf$error
## [1] 0.3936
# 3) Holm model
res_holm <- logit_val_error(vars_holm, dftr, dfval)
res_holm$accuracy
## [1] 0.6064
res_holm$error
## [1] 0.3936
# 4) FDR model
res_fdr <- logit_val_error(vars_fdr, dftr, dfval)
res_fdr$accuracy
## [1] 0.6064
res_fdr$error
## [1] 0.3936

Comment: The unadjusted model at the 0.05 level retained six variables (age, diet_score, bmi, physical_activity_minutes_per_week, family_history_diabetes, and systolic_bp), and this model achieved the highest validation accuracy at 0.6208, corresponding to a validation error of 0.3792. When applying Bonferroni’s family-wise error correction, only three variables (age, physical_activity_minutes_per_week, and family_history_diabetes) remained significant, and the reduced model produced a slightly lower validation accuracy of 0.6064, with an error rate of 0.3936. Holm’s method selected the same three predictors as Bonferroni, leading to identical validation performance with an accuracy of 0.6064 and an error of 0.3936. Lastly, controlling the false discovery rate using the Benjamini–Hochberg procedure at q = 0.05 also resulted in the same three-variable model and the same validation metrics. Overall, the model without multiple-testing adjustments provided the best predictive accuracy, while all adjusted models showed slightly higher error rates due to the more restrictive selection of predictors.

Interpretation

For the strict prediction of diagnosed_diabetes

If the goal is strictly prediction, the model that performs best on the validation set should be chosen, regardless of interpretability or variable selection simplicity. Across all models evaluated, the unadjusted logistic regression achieved the highest validation accuracy (0.6208), outperforming the decision tree, both random forest models, and all of the multiple-testing–adjusted logistic regressions. The random forest models produced similar but slightly lower predictive accuracy, and the decision tree performed worst due to its simplicity and limited structure. Therefore, based on predictive performance alone, I would choose the unadjusted logistic regression model. It uses a moderately sized set of predictors and provides the strongest out-of-sample prediction for diagnosed_diabetes among the approaches considered.

For the relationship between significant variables and diagnosed_diabetes

If the aim is to understand the relationships between predictors and diagnosed diabetes (rather than maximizing predictive accuracy), interpretability, statistical rigor, and control of false positives should be prioritized. In this case, a logistic regression model with multiple-testing correction is most appropriate. Both the Bonferroni and Holm procedures retained the same three variables (age, physical_activity_minutes_per_week, and family_history_diabetes) and these variables also remained significant under FDR control. This consistency strengthens confidence that these predictors are robustly associated with diagnosed diabetes, even after accounting for multiple comparisons. Although the reduced models do not predict as well as the unadjusted logistic model, they provide cleaner inference by limiting spurious findings. Thus, for the purpose of understanding covariate relationships, I would choose the logistic regression model using the variables that survive Holm or FDR correction, as it balances interpretability with appropriate error control.

Principal Components / Hierarchical Clustering

Principal Components/Loadings/BiPlot

## LE8 variables 
le8vars <- c(
  "diet_score",
  "physical_activity_minutes_per_week",
  "smoking_status",
  "sleep_hours_per_day",
  "bmi",
  "cholesterol_total",
  "hdl_cholesterol",
  "hba1c",
  "systolic_bp",
  "diastolic_bp"
)

df_le8 <- df[, le8vars]

## model.matrix performs one-hot encoding
X_le8 <- model.matrix(
  ~ diet_score +
    physical_activity_minutes_per_week +
    smoking_status +
    sleep_hours_per_day +
    bmi +
    cholesterol_total +
    hdl_cholesterol +
    hba1c +
    systolic_bp +
    diastolic_bp,
  data = df_le8
)[, -1]   # remove intercept column


## PCA with centering and scaling
le8.pca <- prcomp(X_le8, center = TRUE, scale. = TRUE)

## Eigenvalues and cumulative variance explained
le8_cve <- le8.pca$sdev^2
le8_pve <- le8_cve / sum(le8_cve)
le8_cum_pve <- cumsum(le8_pve)

le8_pve
##  [1] 0.15625009 0.14614162 0.09990181 0.09422061 0.08979292 0.08679881
##  [7] 0.08187005 0.08057559 0.07178823 0.05890962 0.03375065
le8_cum_pve
##  [1] 0.1562501 0.3023917 0.4022935 0.4965141 0.5863071 0.6731059 0.7549759
##  [8] 0.8355515 0.9073397 0.9662493 1.0000000
## Minimum number of PCs to exceed 75% cumulative variance
which(le8_cum_pve >= 0.75)[1]
## [1] 7
## Scree plot
par(mfrow = c(1, 2))

## Individual variance explained
plot(
  le8_pve,
  type = "b",
  xlab = "# of Components",
  ylab = "Proportion of Variance Explained",
  main = "Scree Plot"
)

## Cumulative variance explained
plot(
  le8_cum_pve,
  type = "b",
  ylim = c(0, 1),
  xlab = "# of Components",
  ylab = "Cumulative Proportion Explained",
  main = "Cumulative Variance Explained"
)
abline(h = 0.75, lty = 2)

par(mfrow = c(1, 1))

## Full loadings
le8.pca$rotation
##                                             PC1         PC2          PC3
## diet_score                         -0.219997571 -0.14437531 -0.047774867
## physical_activity_minutes_per_week -0.122875936 -0.01992516 -0.606440496
## smoking_statusFormer                0.259750879 -0.65628548 -0.017108451
## smoking_statusNever                -0.289666467  0.64299188 -0.008377308
## sleep_hours_per_day                 0.002983006 -0.01513587  0.347210872
## bmi                                 0.507718867  0.21718814  0.011811332
## cholesterol_total                   0.349584735  0.14189226 -0.322501416
## hdl_cholesterol                    -0.273310582 -0.09777385 -0.325974862
## hba1c                               0.264492952  0.13789308  0.415735982
## systolic_bp                         0.411923102  0.17055129 -0.231705159
## diastolic_bp                        0.301318306  0.09702781 -0.268513035
##                                             PC4          PC5         PC6
## diet_score                         -0.628088406 -0.002885865  0.57230478
## physical_activity_minutes_per_week  0.302153182 -0.341226858  0.21217516
## smoking_statusFormer               -0.005710473  0.021615200 -0.06412873
## smoking_statusNever                -0.060988183  0.003276334 -0.04250832
## sleep_hours_per_day                -0.262182489 -0.850195266 -0.27190115
## bmi                                 0.203346084 -0.022774346 -0.01320100
## cholesterol_total                  -0.240187619 -0.168956872  0.13380963
## hdl_cholesterol                    -0.399408371  0.221553707 -0.59503867
## hba1c                              -0.217921375  0.282851830  0.06010799
## systolic_bp                        -0.315225089 -0.004901813  0.13696510
## diastolic_bp                       -0.190507828  0.045449273 -0.39124996
##                                             PC7         PC8         PC9
## diet_score                          0.304383247  0.08445555 -0.04812606
## physical_activity_minutes_per_week  0.034945347 -0.58012593  0.05070471
## smoking_statusFormer               -0.053913946 -0.02450954 -0.01196067
## smoking_statusNever                 0.001763677  0.02956981 -0.01687756
## sleep_hours_per_day                -0.015238898 -0.09382845 -0.05153837
## bmi                                 0.032183307  0.16650212 -0.23267613
## cholesterol_total                  -0.395637588  0.29675257  0.63325993
## hdl_cholesterol                    -0.331608418 -0.09928898 -0.08748093
## hba1c                              -0.096623024 -0.70791549  0.29567385
## systolic_bp                        -0.211337474 -0.13452461 -0.64118520
## diastolic_bp                        0.762534634 -0.03402568  0.17515159
##                                           PC10         PC11
## diet_score                          0.32301295  0.001696639
## physical_activity_minutes_per_week  0.15266604 -0.018402305
## smoking_statusFormer                0.04286789 -0.701017602
## smoking_statusNever                 0.03740983 -0.703196641
## sleep_hours_per_day                 0.04209775  0.015501932
## bmi                                 0.75454504  0.025172997
## cholesterol_total                  -0.06220125 -0.005484053
## hdl_cholesterol                     0.32636874  0.113163421
## hba1c                               0.09576437 -0.003245414
## systolic_bp                        -0.39433662 -0.003712171
## diastolic_bp                       -0.15149387 -0.003597924
## Loadings for PC1 and PC2
round(le8.pca$rotation[, 1:2], 3)
##                                       PC1    PC2
## diet_score                         -0.220 -0.144
## physical_activity_minutes_per_week -0.123 -0.020
## smoking_statusFormer                0.260 -0.656
## smoking_statusNever                -0.290  0.643
## sleep_hours_per_day                 0.003 -0.015
## bmi                                 0.508  0.217
## cholesterol_total                   0.350  0.142
## hdl_cholesterol                    -0.273 -0.098
## hba1c                               0.264  0.138
## systolic_bp                         0.412  0.171
## diastolic_bp                        0.301  0.097
## Helper function for effect-size categories
effect_cat <- function(x) {
  ax <- abs(x)
  case_when(
    ax < 0.2 ~ "minimal",
    ax < 0.3 ~ "small",
    ax < 0.4 ~ "moderate",
    TRUE     ~ "large"
  )
}

## Table for interpreting PC1 and PC2
pc12_loadings <- as.data.frame(le8.pca$rotation[, 1:2])
pc12_loadings$PC1_effect <- effect_cat(pc12_loadings$PC1)
pc12_loadings$PC2_effect <- effect_cat(pc12_loadings$PC2)

pc12_loadings   
##                                             PC1         PC2 PC1_effect
## diet_score                         -0.219997571 -0.14437531      small
## physical_activity_minutes_per_week -0.122875936 -0.01992516    minimal
## smoking_statusFormer                0.259750879 -0.65628548      small
## smoking_statusNever                -0.289666467  0.64299188      small
## sleep_hours_per_day                 0.002983006 -0.01513587    minimal
## bmi                                 0.507718867  0.21718814      large
## cholesterol_total                   0.349584735  0.14189226   moderate
## hdl_cholesterol                    -0.273310582 -0.09777385      small
## hba1c                               0.264492952  0.13789308      small
## systolic_bp                         0.411923102  0.17055129      large
## diastolic_bp                        0.301318306  0.09702781   moderate
##                                    PC2_effect
## diet_score                            minimal
## physical_activity_minutes_per_week    minimal
## smoking_statusFormer                    large
## smoking_statusNever                     large
## sleep_hours_per_day                   minimal
## bmi                                     small
## cholesterol_total                     minimal
## hdl_cholesterol                       minimal
## hba1c                                 minimal
## systolic_bp                           minimal
## diastolic_bp                          minimal
biplot(le8.pca, scale = 0,
       xlab = "PC1",
       ylab = "PC2",
       main = "Biplot of LE8 PCA")

Comment: The PCA was conducted with centering and scaling so that all variables contributed on a comparable scale. Examination of the proportion of variance explained by each component showed that the first few components captured a moderate share of the variation (PC1 = 15.6%, PC2 = 14.6%, PC3 = 10.0%, PC4 = 9.4%). The cumulative variance explained increased steadily, reaching approximately 75.5% after the first seven components. Therefore, seven principal components are required to retain at least 75 percent of the total variation in the LE8 variables. The scree plot and cumulative variance plot both show this point clearly.

PC1 is primarily defined by the variables related to BMI, cholesterol control, and blood pressure, with several large or moderate loadings across these groups. BMI shows a large positive loading (0.508), indicating that higher BMI strongly increases PC1 scores. Systolic blood pressure (0.412) also contributes a large positive effect, and total cholesterol (0.350) and diastolic blood pressure (0.301) contribute moderate positive effects. HDL cholesterol shows a small negative loading (-0.273), suggesting lower HDL increases PC1. Factors representing tobacco exposure (smoking_statusFormer = 0.260; smoking_statusNever = -0.290) show small contributions but in opposite directions. Diet score (-0.220) and physical activity (-0.123) have small-to-minimal negative effects, and sleep and blood sugar measures (sleep_hours_per_day and hba1c) have minimal influence. Overall, PC1 represents a general “cardiometabolic risk” dimension, with higher PC1 scores reflecting higher BMI, higher blood pressure, and higher cholesterol levels, along with modest influence from smoking history.

PC2 PC2 is defined most strongly by smoking-related variables. Both tobacco indicators show large contributions but in opposite directions: smoking_statusFormer loads negatively (-0.657) and smoking_statusNever loads positively (0.643). This makes smoking behavior the dominant factor separating individuals along the PC2 axis. Diet score (-0.144), physical activity (-0.020), hba1c (0.138), and systolic/diastolic blood pressure have minimal influences. BMI (0.217) and total cholesterol (0.142) contribute small-to-minimal associations. Thus, PC2 primarily represents a “smoking behavior” axis, where higher PC2 values correspond to never-smokers, and lower values correspond to former smokers.

To summarize,

PC1 = a cardiometabolic risk profile (Dominated by BMI, blood pressure, and cholesterol, with moderate influence from smoking category. High PC1 scores reflect unfavorable metabolic and cardiovascular measures.)

PC2 = a smoking-status axis (Driven almost entirely by the two smoking indicator variables, distinguishing never-smokers from former smokers, with minimal contribution from the other LE8 components.)

The biplot displays three visible clusters because the PCA includes the smoking_status factor, which was converted into two indicator variables and implicitly represents three categories: never smokers, former smokers, and current smokers. These smoking indicators have the largest loadings on PC2, causing the three smoking groups to separate distinctly along that axis, while PC1 captures variation in cardiometabolic risk through BMI, blood pressure, and cholesterol levels, creating additional spread within each smoking category. As a result, the combination of a strongly separating categorical variable (smoking status) and continuous variation in metabolic health produces three natural clusters in the biplot.

Hierarchical Clustering

## Use the smallest number of PCs that explain at least 75 percent
pc_keep <- which(le8_cum_pve >= 0.75)[1]
pc_keep
## [1] 7
## PCA scores for the selected components
pc_scores <- le8.pca$x[, 1:pc_keep]

## Distance matrix
dist_pc <- dist(pc_scores)

## Hierarchical clustering with three linkage methods
hc_complete <- hclust(dist_pc, method = "complete")
hc_average  <- hclust(dist_pc, method = "average")
hc_single   <- hclust(dist_pc, method = "single")

## Plot dendrograms for each linkage
par(mfrow = c(1, 3))

plot(hc_complete,
     main = "Hierarchical Clustering (Complete Linkage)",
     xlab = "",
     sub  = "",
     cex  = 0.6)

plot(hc_average,
     main = "Hierarchical Clustering (Average Linkage)",
     xlab = "",
     sub  = "",
     cex  = 0.6)

plot(hc_single,
     main = "Hierarchical Clustering (Single Linkage)",
     xlab = "",
     sub  = "",
     cex  = 0.6)

par(mfrow = c(1, 1))

## Cut each dendrogram into 3 clusters and get cluster sizes
clust_complete3 <- cutree(hc_complete, k = 3)
clust_average3  <- cutree(hc_average,  k = 3)
clust_single3   <- cutree(hc_single,   k = 3)

table(clust_complete3)
## clust_complete3
##    1    2    3 
##  792 1964 2244
table(clust_average3)
## clust_average3
##    1    2    3 
## 4997    2    1
table(clust_single3)
## clust_single3
##    1    2    3 
## 4998    1    1
## Add complete-linkage cluster membership to the full dataset
df$cluster_complete3 <- factor(clust_complete3)

## Compute means of all numeric variables by cluster
cluster_means <- df %>%
  group_by(cluster_complete3) %>%
  summarise(
    across(
      where(is.numeric),
      ~ mean(.x, na.rm = TRUE)
    )
  )


df <- df %>%
  mutate(
    smoke_former = ifelse(smoking_status == "Former", 1, 0),
    smoke_never  = ifelse(smoking_status == "Never", 1, 0)
  )

cluster_means_le8 <- df %>%
  group_by(cluster_complete3) %>%
  summarise(
    across(
      c(
        "diet_score",
        "physical_activity_minutes_per_week",
        "sleep_hours_per_day",
        "bmi",
        "cholesterol_total",
        "hdl_cholesterol",
        "hba1c",
        "systolic_bp",
        "diastolic_bp",
        "smoke_former",
        "smoke_never",
        "diagnosed_diabetes"
      ),
      ~mean(.x, na.rm = TRUE)
    )
  )
## Warning: There were 3 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `across(...)`.
## ℹ In group 1: `cluster_complete3 = 1`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.
print(cluster_means, width = Inf)
## # A tibble: 3 × 15
##   cluster_complete3   age physical_activity_minutes_per_week diet_score
##   <fct>             <dbl>                              <dbl>      <dbl>
## 1 1                  51.9                              179.        6.18
## 2 2                  53.3                               91.1       5.69
## 3 3                  45.9                              121.        6.08
##   sleep_hours_per_day screen_time_hours_per_day   bmi systolic_bp diastolic_bp
##                 <dbl>                     <dbl> <dbl>       <dbl>        <dbl>
## 1                7.29                      5.91  26.6        120.         75.0
## 2                6.78                      6.03  27.0        120.         77.2
## 3                7.14                      5.92  24.1        110.         73.8
##   heart_rate cholesterol_total hdl_cholesterol ldl_cholesterol triglycerides
##        <dbl>             <dbl>           <dbl>           <dbl>         <dbl>
## 1       68.7              197.            48.1           119.           126.
## 2       70.3              191.            51.9           110.           129.
## 3       69.5              177.            57.9            91.4          112.
##   hba1c
##   <dbl>
## 1  6.34
## 2  6.82
## 3  6.28
print(cluster_means_le8, width = Inf)
## # A tibble: 3 × 13
##   cluster_complete3 diet_score physical_activity_minutes_per_week
##   <fct>                  <dbl>                              <dbl>
## 1 1                       6.18                              179. 
## 2 2                       5.69                               91.1
## 3 3                       6.08                              121. 
##   sleep_hours_per_day   bmi cholesterol_total hdl_cholesterol hba1c systolic_bp
##                 <dbl> <dbl>             <dbl>           <dbl> <dbl>       <dbl>
## 1                7.29  26.6              197.            48.1  6.34        120.
## 2                6.78  27.0              191.            51.9  6.82        120.
## 3                7.14  24.1              177.            57.9  6.28        110.
##   diastolic_bp smoke_former smoke_never diagnosed_diabetes
##          <dbl>        <dbl>       <dbl>              <dbl>
## 1         75.0        0.174       0.535                 NA
## 2         77.2        0.254       0.526                 NA
## 3         73.8        0.160       0.705                 NA
## Ensure diagnosed_diabetes is coded as 0/1 or factor with two levels
## For safety, convert to factor here:
df$diagnosed_diabetes <- as.factor(df$diagnosed_diabetes)

## Fit logistic regression with cluster membership as predictor
glm_cluster <- glm(
  diagnosed_diabetes ~ cluster_complete3,
  data   = df,
  family = binomial
)

summary(glm_cluster)
## 
## Call:
## glm(formula = diagnosed_diabetes ~ cluster_complete3, family = binomial, 
##     data = df)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         0.03536    0.07108   0.497    0.619    
## cluster_complete32  0.95741    0.08737  10.958   <2e-16 ***
## cluster_complete33 -0.07458    0.08268  -0.902    0.367    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6778.6  on 4999  degrees of freedom
## Residual deviance: 6500.2  on 4997  degrees of freedom
## AIC: 6506.2
## 
## Number of Fisher Scoring iterations: 4
## Predicted class based on 0.5 cutoff
prob_cluster <- predict(glm_cluster, type = "response")

## If factor levels are "0", "1", map probabilities to those labels
pred_cluster <- ifelse(prob_cluster >= 0.5, "1", "0")
pred_cluster <- factor(pred_cluster, levels = levels(df$diagnosed_diabetes))

## Confusion matrix comparing predicted vs observed
conf_mat_cluster <- confusionMatrix(
  data      = pred_cluster,
  reference = df$diagnosed_diabetes
)

conf_mat_cluster
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1144 1100
##          1  920 1836
##                                           
##                Accuracy : 0.596           
##                  95% CI : (0.5822, 0.6096)
##     No Information Rate : 0.5872          
##     P-Value [Acc > NIR] : 0.1057          
##                                           
##                   Kappa : 0.1773          
##                                           
##  Mcnemar's Test P-Value : 6.814e-05       
##                                           
##             Sensitivity : 0.5543          
##             Specificity : 0.6253          
##          Pos Pred Value : 0.5098          
##          Neg Pred Value : 0.6662          
##              Prevalence : 0.4128          
##          Detection Rate : 0.2288          
##    Detection Prevalence : 0.4488          
##       Balanced Accuracy : 0.5898          
##                                           
##        'Positive' Class : 0               
## 

Comment: After retaining the first seven principal components, which together explain at least 75 percent of the total variance, hierarchical clustering was performed on the PC score matrix using complete, average, and single linkage methods. The dendrograms for each linkage show visibly different clustering structures, reflecting how each method defines inter-cluster distance. When the trees were cut to form three clusters, the cluster sizes varied substantially depending on the linkage method. Complete linkage produced a relatively balanced distribution across the three clusters (792, 1964, and 2244 individuals), suggesting that it separated groups based on overall dissimilarity across all members. In contrast, average linkage yielded one extremely large cluster (n = 4997) and two very small clusters (n = 2 and n = 1), and single linkage exhibited the same pattern (n = 4998, 1, and 1), indicating severe chaining and poor separation of meaningful groups. These results show that complete linkage is the only method that forms interpretable and non-degenerate clusters in this dataset.

Using the complete-linkage solution with three clusters, the mean profiles of the variables revealed clear behavioral and cardiometabolic distinctions across clusters. Cluster 1 showed the highest physical activity levels (about 179 minutes per week), relatively favorable diet scores, and intermediate BMI and blood pressure, suggesting a more physically active and moderately healthy group. Cluster 2 had the lowest physical activity (around 91 minutes), the lowest diet scores, and the highest BMI, systolic and diastolic blood pressure, and triglycerides, representing a higher metabolic risk profile. Cluster 3 was younger on average, with physical activity levels between clusters 1 and 2, the lowest BMI, the highest HDL cholesterol, and the lowest total cholesterol and triglycerides, reflecting a comparatively healthier lipid and body composition profile. Smoking patterns also differed, with Cluster 3 having the highest proportion of never smokers and the lowest proportion of former smokers. Together, these results indicate that complete-linkage clustering separated individuals into groups that meaningfully differ in activity patterns, adiposity, lipid profiles, and related cardiometabolic characteristics.

Using cluster membership as the sole predictor in a logistic regression model resulted in modest discrimination of diagnosed diabetes. The model identified a significantly higher odds of diabetes in Cluster 2 compared with Cluster 1, while Cluster 3 did not differ meaningfully from the reference group. However, this limited separation among clusters translated into relatively low predictive performance. The overall accuracy was about 0.596, only slightly above the no-information rate of 0.587, and the Kappa statistic of 0.177 indicated poor agreement beyond chance. Sensitivity and specificity were balanced but modest, and the confusion matrix showed substantial misclassification in both directions.

Compared with the supervised learning methods in Problem 1, this accuracy is somewhat lower. In Problem 1, the classification tree achieved an accuracy of about 0.57, while the random forest models reached around 0.59–0.60, and logistic regression models using individual predictors achieved similar or slightly higher levels depending on variable selection. Because cluster membership is derived from unsupervised learning and does not directly optimize prediction of diagnosed diabetes, this reduced accuracy is expected. Overall, the clustering-based model does not outperform the supervised models, reinforcing that unsupervised grouping is not an effective predictor of diabetes in this dataset.